Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')Project Report for STAT 303-1 Fall 2025
This study integrates four analytical approaches to assess how climate change influences crop yields and broader economic outcomes. First, these analytical questions allow for the study of how climate and agriculture vary over time and on a regional level. Across analyses, temperature patterns and extreme weather events consistently emerged as the primary drivers of agricultural performance. While global temperature trends showed no strong temporal patterns, regional analyses revealed substantial geographic variation, indicating that climate impacts are more meaningfully interpreted at the regional scale. Furthermore, the analysis questions worked to answer how well climate change variables correlated with crop yield and how they affect the economy. A key global finding shows that crop yields peak within an average temperature range of 10–20°C. Activity of Very High
extreme weather event levels has also increased in recent years, is most frequent in the United States, and coincides with relatively higher U.S. crop yields under severe conditions. Economic impact also showed a strong positive linear relationship with crop yield. These findings together highlight the need for region-specific climate policies and agricultural investments that enhance crop and soil resilience against climate change events, which can overall have an impact on economic welfare.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')# Describing the Data Source
data = pd.read_csv('climate_change_impact_on_agriculture_2024.csv')
data.head()| Year | Country | Region | Crop_Type | Average_Temperature_C | Total_Precipitation_mm | CO2_Emissions_MT | Crop_Yield_MT_per_HA | Extreme_Weather_Events | Irrigation_Access_% | Pesticide_Use_KG_per_HA | Fertilizer_Use_KG_per_HA | Soil_Health_Index | Adaptation_Strategies | Economic_Impact_Million_USD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2001 | India | West Bengal | Corn | 1.55 | 447.06 | 15.22 | 1.737 | 8 | 14.54 | 10.08 | 14.78 | 83.25 | Water Management | 808.13 |
| 1 | 2024 | China | North | Corn | 3.23 | 2913.57 | 29.82 | 1.737 | 8 | 11.05 | 33.06 | 23.25 | 54.02 | Crop Rotation | 616.22 |
| 2 | 2001 | France | Ile-de-France | Wheat | 21.11 | 1301.74 | 25.75 | 1.719 | 5 | 84.42 | 27.41 | 65.53 | 67.78 | Water Management | 796.96 |
| 3 | 2001 | Canada | Prairies | Coffee | 27.85 | 1154.36 | 13.91 | 3.890 | 5 | 94.06 | 14.38 | 87.58 | 91.39 | No Adaptation | 790.32 |
| 4 | 1998 | India | Tamil Nadu | Sugarcane | 2.19 | 1627.48 | 11.81 | 1.080 | 9 | 95.75 | 44.35 | 88.08 | 49.61 | Crop Rotation | 401.72 |
#Quick Analysis:
print('number of observations',len(data))
print()
display('list of columns:',list(data.columns))
print()
display('data shape',data.shape)
print()
display(data.describe())
print()
display(data.dtypes)number of observations 10000
'list of columns:'
['Year',
'Country',
'Region',
'Crop_Type',
'Average_Temperature_C',
'Total_Precipitation_mm',
'CO2_Emissions_MT',
'Crop_Yield_MT_per_HA',
'Extreme_Weather_Events',
'Irrigation_Access_%',
'Pesticide_Use_KG_per_HA',
'Fertilizer_Use_KG_per_HA',
'Soil_Health_Index',
'Adaptation_Strategies',
'Economic_Impact_Million_USD']
'data shape'
(10000, 15)
| Year | Average_Temperature_C | Total_Precipitation_mm | CO2_Emissions_MT | Crop_Yield_MT_per_HA | Extreme_Weather_Events | Irrigation_Access_% | Pesticide_Use_KG_per_HA | Fertilizer_Use_KG_per_HA | Soil_Health_Index | Economic_Impact_Million_USD | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
| mean | 2007.088700 | 15.241299 | 1611.663834 | 15.246608 | 2.240017 | 4.980900 | 55.248332 | 24.955735 | 49.973708 | 64.901278 | 674.269658 |
| std | 10.084245 | 11.466955 | 805.016815 | 8.589423 | 0.998342 | 3.165808 | 25.988305 | 14.490962 | 28.711027 | 20.195882 | 414.591431 |
| min | 1990.000000 | -4.990000 | 200.150000 | 0.500000 | 0.450000 | 0.000000 | 10.010000 | 0.000000 | 0.010000 | 30.000000 | 47.840000 |
| 25% | 1999.000000 | 5.430000 | 925.697500 | 7.760000 | 1.449000 | 2.000000 | 32.677500 | 12.527500 | 25.390000 | 47.235000 | 350.545000 |
| 50% | 2007.000000 | 15.175000 | 1611.160000 | 15.200000 | 2.170000 | 5.000000 | 55.175000 | 24.930000 | 49.635000 | 64.650000 | 583.920000 |
| 75% | 2016.000000 | 25.340000 | 2306.997500 | 22.820000 | 2.930000 | 8.000000 | 77.582500 | 37.470000 | 74.825000 | 82.472500 | 917.505000 |
| max | 2024.000000 | 35.000000 | 2999.670000 | 30.000000 | 5.000000 | 10.000000 | 99.990000 | 49.990000 | 99.990000 | 100.000000 | 2346.470000 |
Year int64
Country object
Region object
Crop_Type object
Average_Temperature_C float64
Total_Precipitation_mm float64
CO2_Emissions_MT float64
Crop_Yield_MT_per_HA float64
Extreme_Weather_Events int64
Irrigation_Access_% float64
Pesticide_Use_KG_per_HA float64
Fertilizer_Use_KG_per_HA float64
Soil_Health_Index float64
Adaptation_Strategies object
Economic_Impact_Million_USD float64
dtype: object
# Numeric and Categorical Variables:
numeric_cols = data.select_dtypes(include='number').columns.tolist()
categorical_cols = data.select_dtypes(include=object).columns.tolist()
display('Numerical Columns:',numeric_cols)
print('Number of Numeric Columns:', len(numeric_cols))
print()
display('Categorical Columns:',categorical_cols)
print('Number of Numeric Columns:', len(categorical_cols))'Numerical Columns:'
['Year',
'Average_Temperature_C',
'Total_Precipitation_mm',
'CO2_Emissions_MT',
'Crop_Yield_MT_per_HA',
'Extreme_Weather_Events',
'Irrigation_Access_%',
'Pesticide_Use_KG_per_HA',
'Fertilizer_Use_KG_per_HA',
'Soil_Health_Index',
'Economic_Impact_Million_USD']
Number of Numeric Columns: 11
'Categorical Columns:'
['Country', 'Region', 'Crop_Type', 'Adaptation_Strategies']
Number of Numeric Columns: 4
# Any missing values in the dataset?
missing = data.isnull().sum()
print(missing)Year 0
Country 0
Region 0
Crop_Type 0
Average_Temperature_C 0
Total_Precipitation_mm 0
CO2_Emissions_MT 0
Crop_Yield_MT_per_HA 0
Extreme_Weather_Events 0
Irrigation_Access_% 0
Pesticide_Use_KG_per_HA 0
Fertilizer_Use_KG_per_HA 0
Soil_Health_Index 0
Adaptation_Strategies 0
Economic_Impact_Million_USD 0
dtype: int64
# Since there are no missing values, we will proceed with checking for outliers for the entire dataset
# Each part will address whether or not to keep the outliers for certain vars
# Define a function to return all outliers in each col:
def outliers_func(x): # x is a var
Q1 = np.percentile(x,25)
Q3 = np.percentile(x,75)
IQR = Q3 - Q1
lower_fence = Q1- 1.5*IQR
upper_fence = Q3 + 1.5*IQR
return (x > upper_fence) | (x < lower_fence)
numeric = data.select_dtypes(include='number')
outliers = numeric.apply(outliers_func)
outliers.sum()
# There are only outliers in the "Economic_Impact_Million_USD" varYear 0
Average_Temperature_C 0
Total_Precipitation_mm 0
CO2_Emissions_MT 0
Crop_Yield_MT_per_HA 0
Extreme_Weather_Events 0
Irrigation_Access_% 0
Pesticide_Use_KG_per_HA 0
Fertilizer_Use_KG_per_HA 0
Soil_Health_Index 0
Economic_Impact_Million_USD 177
dtype: int64
# Clean all vars excluding country, region, pesticide use, fertilizer use, adaptation strategies, economic_impact
# No missing values, so no imputing needed
# No outliers in these vars, so no adjustment
# 1) Since we are analyzing trends over the years, we really only want to bin "Extreme_Weather_Events" (explained more in the report)
# 2) Check for any inconsistent data in the vars we are using
# These vars cannot be negative: Extreme_Weather_Events, Irrigation_Access_%, Crop_Yield, Total_Precipitation
# Check for consistent data_types in vars (all floats, ints, strings, etc.) --- checked in "Data" section
# Percentages can't be outside 0-100# Number of categories in "Crop Type"
display(data['Crop_Type'].unique())
print(data['Crop_Type'].nunique())array(['Corn', 'Wheat', 'Coffee', 'Sugarcane', 'Fruits', 'Rice', 'Barley',
'Vegetables', 'Soybeans', 'Cotton'], dtype=object)
10
Extreme_Weather_Events
import seaborn as sns
import matplotlib.pyplot as plt# Must first see if data is skewed to determine whether to use .cut or .qcut
plt.hist(data['Extreme_Weather_Events'])
plt.xlabel('Extreme Weather Events')
plt.ylabel('Frequency')
plt.show()# Skewed left so .qcut will be used
binned_extreme_weather = pd.qcut(data['Extreme_Weather_Events'], q=4, labels=['Low','Medium','High','Very High'],retbins = True)
data_cleaned_1 = data.copy()
data_cleaned_1['Extreme_Weather_Events'] = binned_extreme_weather[0]# There are previously defined ranges:
# 0-59 --> poor/fair
# 60-79 --> good
# 80-100 --> excellent
bins = [0, 59, 79, 100]
labels = ['Poor/Fair', 'Good', 'Excellent']
binned_soil_health_idx = pd.cut(data_cleaned_1['Soil_Health_Index'], bins=bins, labels=labels, include_lowest=True,retbins = True)
data_cleaned_1['Soil_Health_Index'] = binned_soil_health_idx[0]# A) Check for negatives where they shouldn't be
numeric = data.select_dtypes(include='number')
neg_count = (numeric < 0).sum()
neg_count
# All vars are accurate (Temperature should have some negative values)Year 0
Average_Temperature_C 1185
Total_Precipitation_mm 0
CO2_Emissions_MT 0
Crop_Yield_MT_per_HA 0
Extreme_Weather_Events 0
Irrigation_Access_% 0
Pesticide_Use_KG_per_HA 0
Fertilizer_Use_KG_per_HA 0
Soil_Health_Index 0
Economic_Impact_Million_USD 0
dtype: int64
### B) Percentages can't be outside 0-100
range_vals = numeric.min().astype(str) +' - '+numeric.max().astype(str)
range_vals
# All data is accurate
# Also note here that soil_health_index ranges from 30 to 100, extreme_weather ranges from 0-10 and year ranges from 1990 to 2024Year 1990.0 - 2024.0
Average_Temperature_C -4.99 - 35.0
Total_Precipitation_mm 200.15 - 2999.67
CO2_Emissions_MT 0.5 - 30.0
Crop_Yield_MT_per_HA 0.45 - 5.0
Extreme_Weather_Events 0.0 - 10.0
Irrigation_Access_% 10.01 - 99.99
Pesticide_Use_KG_per_HA 0.0 - 49.99
Fertilizer_Use_KG_per_HA 0.01 - 99.99
Soil_Health_Index 30.0 - 100.0
Economic_Impact_Million_USD 47.84 - 2346.47
dtype: object
# For this question, yearly trends will now be compared between regions
# 1) Main task here will be to ensure country names and regions are consistent
# 2) Check for equal time coverage across countries - (if Africa has 1990–2025 but South America has 2000–2025)# 1) Country names and regions are consistent
print('country names:')
print(data['Country'].unique())
# Note here -- United States is listed as USA (good for referencing later)
print('regions:')
print(data['Region'].unique())
# regions describe regions of each country, not regions of the world
# Will need to pair them with the country to standardize
# since there are not that many countries, looking at regions within a country may be more insightful than a global analysiscountry names:
['India' 'China' 'France' 'Canada' 'USA' 'Argentina' 'Australia' 'Nigeria'
'Russia' 'Brazil']
regions:
['West Bengal' 'North' 'Ile-de-France' 'Prairies' 'Tamil Nadu' 'Midwest'
'Northeast' 'New South Wales' 'Punjab' 'North West' 'South East'
'Grand Est' 'Northwestern' 'Siberian' 'Northwest' 'Victoria'
'Nouvelle-Aquitaine' 'South' 'Quebec' 'Southeast' 'Ontario' 'East'
'Pampas' 'Western Australia' 'Volga' 'Maharashtra'
'Provence-Alpes-Cote d’Azur' 'West' 'Central' 'North Central' 'Patagonia'
'Queensland' 'South West' 'British Columbia']
# Pairing each region with its country
display(pd.DataFrame(data.groupby('Country')['Region'].unique()))
# So there are multiple duplicates (ex. North Brazil, North China)
data_cleaned_2 = data_cleaned_1.copy()
# For each region in each observation, we want to pair it with its country
# We can make a function
def region_pair_func(x,y): # x is the country, y is the region
return y + '(' + x + ')'
data_cleaned_2['Region'] = region_pair_func(data_cleaned_2['Country'],data_cleaned_2['Region'])
data_cleaned_2['Region'].head()| Region | |
|---|---|
| Country | |
| Argentina | [Northeast, Northwest, Pampas, Patagonia] |
| Australia | [New South Wales, Victoria, Western Australia,... |
| Brazil | [North, Northeast, Southeast, South] |
| Canada | [Prairies, Quebec, Ontario, British Columbia] |
| China | [North, East, South, Central] |
| France | [Ile-de-France, Grand Est, Nouvelle-Aquitaine,... |
| India | [West Bengal, Tamil Nadu, Punjab, Maharashtra] |
| Nigeria | [North West, South East, North Central, South ... |
| Russia | [Northwestern, Siberian, Volga, Central] |
| USA | [Midwest, Northeast, South, West] |
0 West Bengal(India)
1 North(China)
2 Ile-de-France(France)
3 Prairies(Canada)
4 Tamil Nadu(India)
Name: Region, dtype: object
# 2) Check for equal time coverage across countries - (if Africa has 1990–2025 but South America has 2000–2025)
data_cleaned_2.groupby('Region')['Year'].agg(lambda x: x.max() - x.min())
# it looks like all regions have the same range of years (the max), so there are no yearly differencesRegion
British Columbia(Canada) 34
Central(China) 34
Central(Russia) 34
East(China) 34
Grand Est(France) 34
Ile-de-France(France) 34
Maharashtra(India) 34
Midwest(USA) 34
New South Wales(Australia) 34
North Central(Nigeria) 34
North West(Nigeria) 34
North(Brazil) 34
North(China) 34
Northeast(Argentina) 34
Northeast(Brazil) 34
Northeast(USA) 34
Northwest(Argentina) 34
Northwestern(Russia) 34
Nouvelle-Aquitaine(France) 34
Ontario(Canada) 34
Pampas(Argentina) 34
Patagonia(Argentina) 34
Prairies(Canada) 34
Provence-Alpes-Cote d’Azur(France) 34
Punjab(India) 34
Quebec(Canada) 34
Queensland(Australia) 34
Siberian(Russia) 34
South East(Nigeria) 34
South West(Nigeria) 34
South(Brazil) 34
South(China) 34
South(USA) 34
Southeast(Brazil) 34
Tamil Nadu(India) 34
Victoria(Australia) 34
Volga(Russia) 34
West Bengal(India) 34
West(USA) 34
Western Australia(Australia) 34
Name: Year, dtype: int64
# How do climate change factors correlate with crop yield?
# We can do this within each country since there may be global differences
# Same vars for the previous two questions
# Unrealistic values were checked in question 1
# However, it may be interesting to see how the relationship between climate variables and crop yield differs by crop type.
# For this, we need to OHE crop type
data_cleaned_3 = data_cleaned_2.copy()
# Check to see the categories in crop type
print(data_cleaned_3['Crop_Type'].nunique())
print(data_cleaned_3['Crop_Type'].unique())
crop_type_OHE = pd.get_dummies(data_cleaned_3['Crop_Type'], drop_first= True)
# It can now be used for data analysis/visualization if i so choose10
['Corn' 'Wheat' 'Coffee' 'Sugarcane' 'Fruits' 'Rice' 'Barley' 'Vegetables'
'Soybeans' 'Cotton']
# How do climate change and agricultural practices collectively influence economic impact from crop production globally?
# 1) As we found earlier, there are many outliers in the economic_impact variable, which is introduced in this question
# Filter the outliers
def outliers_func(x): # x is a var
Q1 = np.percentile(x,25)
Q3 = np.percentile(x,75)
IQR = Q3 - Q1
lower_fence = Q1- 1.5*IQR
upper_fence = Q3 + 1.5*IQR
return (x > upper_fence) | (x < lower_fence)
print(outliers_func(data_cleaned_4['Economic_Impact_Million_USD']).sum())
data_cleaned_4 = data_cleaned_3.copy()
data_cleaned_4.loc[outliers_func(data_cleaned_4['Economic_Impact_Million_USD'])]
# When dealing with economic impact, a lot of these outliers may actually be plausible
# However, they may be unrealistic if the crop yield is low and the impact is very high
# Which of these outliers are above Tukey's fences?
def outliers_func(x): # x is a var
Q1 = np.percentile(x,25)
Q3 = np.percentile(x,75)
IQR = Q3 - Q1
lower_fence = Q1- 1.5*IQR
upper_fence = Q3 + 1.5*IQR
return (x > upper_fence)
print(outliers_func(data_cleaned_4['Economic_Impact_Million_USD']).sum())
# All of them are high values (none are below 177
177
outliers = data_cleaned_4.loc[outliers_func(data_cleaned_4['Economic_Impact_Million_USD'])]
# Out of these outliers, which have low crop yields?
print('The lowest Crop Yield of all the outliers is:')
print(outliers['Crop_Yield_MT_per_HA'].min())
print('Average of Crop_Yield:')
print(data_cleaned_4['Crop_Yield_MT_per_HA'].mean())
# So the outliers make sense because all of the data in the outliers df has a crop yield above the average crop_yield
# We keep the outliers because it is plausible that a high economic impact is due to a high crop yieldThe lowest Crop Yield of all the outliers is:
3.56
Average of Crop_Yield:
2.2400169
# 2) To measure economic impact, it may also be useful to look at adaptation strategies
data_cleaned_4['Adaptation_Strategies'].unique()
# There is no cleaning to be done here because there are 5 nicely-grouped categoriesarray(['Water Management', 'Crop Rotation', 'No Adaptation',
'Organic Farming', 'Drought-resistant Crops'], dtype=object)
# There are 3 main categories for analysis:
# 1) Crop yield over time and also per crop type
# 2) Climate variables over time -- temperature, precipitation, CO2 emissions, and extreme weather
# 3) Agricultural factors over time -- soil health index
# irrigation access, pesticide/fertilizer use can be looked at separately to validate SHI# 1a) Crop yield over time
plt.figure(figsize=(12, 6))
sns.lineplot(data = data_cleaned_1, x = 'Year', y = 'Crop_Yield_MT_per_HA')
plt.show()
# For each row of the same month value, the y-value is the average of the 'crop_yield'
# The shaded region is the 95% CI of the each point
# 1b) Crop yield per crop type
plt.figure(figsize=(12,6))
crop_grouped = data_cleaned_1.groupby(['Year','Crop_Type'])['Crop_Yield_MT_per_HA'].agg('mean').reset_index()
sns.lineplot(data = crop_grouped, x = 'Year', y = 'Crop_Yield_MT_per_HA', hue = 'Crop_Type')
plt.show()
# The "per crop type" plot mirrors the overall plot and is a lot less clean
# so not necessary for the report# 2) Climate variables over time -- temperature, precipitation, CO2 emissions, and extreme weather
# For first three vars, we will do subplots
climate_vars = data_cleaned_1.loc[:,['Year','Average_Temperature_C','Total_Precipitation_mm','CO2_Emissions_MT']]
climate_melted = climate_vars.melt(id_vars = 'Year', var_name = 'Climate_Var', value_name = 'Values')
a = sns.FacetGrid(climate_melted, col = 'Climate_Var',col_wrap=1, height=5, aspect=2, sharey=False)
# sharey ensures each has its own y axis
a.map_dataframe(sns.lineplot, x='Year', y='Values')
a.set_titles("{col_name}")
plt.show();
# For extreme weather events, we can do a bar plot
# reminder -- data_cleaned_1['Extreme_Weather_Events'] = binned_extreme_weather[0]
extreme_counts = (data_cleaned_1.groupby(['Year', 'Extreme_Weather_Events']).size().reset_index(name='Count'))
plt.figure(figsize=(14,6))
sns.lineplot(data=extreme_counts, x='Year', y='Count', hue='Extreme_Weather_Events')
plt.title("Trend of Extreme Weather Events by Category")
plt.ylabel("Number of Observations")
plt.show()
# no binned data
plt.figure(figsize=(14,6))
sns.lineplot(data=data, x='Year', y='Extreme_Weather_Events')
plt.show()# Agricultural factors over time -- Soil Health Index
# not binned
plt.figure(figsize=(12, 6))
sns.lineplot(data = data, x = 'Year', y = 'Soil_Health_Index')
plt.show()
#binned
SHI_counts = (data_cleaned_1.groupby(['Year', 'Soil_Health_Index']).size().reset_index(name='Count'))
plt.figure(figsize=(14,6))
sns.lineplot(data=SHI_counts, x='Year', y='Count', hue='Soil_Health_Index')
plt.title("Trend of Soil Health Index by Category")
plt.ylabel("Number of Observations")
plt.show()
sns.histplot(data['Soil_Health_Index'])# Check to see if any impact from other vars
agri_vars = data_cleaned_1.loc[:, ['Year', 'Irrigation_Access_%', 'Pesticide_Use_KG_per_HA', 'Fertilizer_Use_KG_per_HA']]
agri_melted = agri_vars.melt(id_vars='Year', var_name='Agri_Var', value_name='Values')
g = sns.FacetGrid(agri_melted, col='Agri_Var', col_wrap=3, height=3, aspect=1, sharey=False)
g.map_dataframe(sns.lineplot, x='Year', y='Values')
g.set_titles("{col_name}")
g.set_axis_labels("Year", "")
plt.tight_layout()
plt.show()# From the previous question, we saw the greatest trends in "Crop Yield",
# "Extreme Weather Events", and "Soil Health Index".
# However, we will plot temperature changes on the map to see regional differences as a baseline# 1) Plotting temperature
region_pivot = data_cleaned_1.pivot_table(
index = 'Region', columns = 'Country', values = 'Average_Temperature_C', aggfunc = 'mean'
).reset_index()
region_melt = region_pivot.melt(id_vars='Region',var_name = 'Country',value_name = 'Average Temp (C)')
region_melt = region_melt.dropna(subset=['Average Temp (C)'])
a = sns.FacetGrid(region_melt, col = 'Country',col_wrap=3, height=4, aspect = 1,sharex=False)
a.map(sns.barplot,'Region','Average Temp (C)')
a.set_xticklabels(rotation=45)
a.tight_layout()
plt.show();
# title as regional differences in temperature for each country# Now lets look at regional changes:
# use og data to look at numeric changes
# want raw values from data but region names from data_cleaned_2:
raw_values = data[['Crop_Yield_MT_per_HA', 'Soil_Health_Index', 'Year']].copy()
raw_values['Region'] = data_cleaned_2['Region']
raw_values['Country'] = data_cleaned_2['Country']
raw_values = raw_values[['Country', 'Region', 'Year', 'Crop_Yield_MT_per_HA', 'Soil_Health_Index']]
change_df = raw_values.groupby(['Country', 'Region']).apply(
lambda x: pd.Series({
'Crop_Yield_Change': abs(x.loc[x['Year'].idxmax(), 'Crop_Yield_MT_per_HA']
- x.loc[x['Year'].idxmin(), 'Crop_Yield_MT_per_HA']),
'Soil_Health_Index_Change': abs(x.loc[x['Year'].idxmax(), 'Soil_Health_Index']
- x.loc[x['Year'].idxmin(), 'Soil_Health_Index'])
})
).reset_index()
#display((change_df.sort_values(['Crop_Yield_Change'], ascending=False)).head(10))
#gives us the top 10 regions with greatest change in crop yield
#display((change_df.sort_values(['Soil_Health_Index_Change'], ascending=False)).head(10))
#gives us the top 10 regions with greatest change in soil health index
# Concatenate! (for the report)
top_crop = (
change_df.sort_values('Crop_Yield_Change', ascending=False)
.head(10)[['Country', 'Region', 'Crop_Yield_Change']]
.reset_index(drop=True)
)
top_shi = (
change_df.sort_values('Soil_Health_Index_Change', ascending=False)
.head(10)[['Country', 'Region', 'Soil_Health_Index_Change']]
.reset_index(drop=True)
)
# Place them side-by-side
combined = pd.concat([top_crop, top_shi], axis=1)
display(combined)
# we can't do this on extreme events because all regions have discrete values from 0-10| Country | Region | Crop_Yield_Change | Country | Region | Soil_Health_Index_Change | |
|---|---|---|---|---|---|---|
| 0 | Brazil | Northeast(Brazil) | 2.691 | Brazil | South(Brazil) | 54.30 |
| 1 | France | Ile-de-France(France) | 2.549 | Russia | Siberian(Russia) | 49.48 |
| 2 | Argentina | Northeast(Argentina) | 2.521 | France | Nouvelle-Aquitaine(France) | 49.32 |
| 3 | Russia | Northwestern(Russia) | 2.520 | Argentina | Northwest(Argentina) | 45.44 |
| 4 | Brazil | North(Brazil) | 2.069 | Russia | Northwestern(Russia) | 43.93 |
| 5 | Brazil | Southeast(Brazil) | 2.023 | India | Punjab(India) | 42.70 |
| 6 | USA | South(USA) | 1.878 | Argentina | Patagonia(Argentina) | 42.04 |
| 7 | India | West Bengal(India) | 1.701 | India | Maharashtra(India) | 41.52 |
| 8 | Argentina | Northwest(Argentina) | 1.700 | Argentina | Northeast(Argentina) | 35.92 |
| 9 | Russia | Volga(Russia) | 1.673 | Argentina | Pampas(Argentina) | 32.90 |
# First make a palette for each country
countries = change_df['Country'].unique()
palette = sns.color_palette("tab10", n_colors=len(countries))
palette_dict = dict(zip(countries, palette))
# now plot
#top_crop = change_df.sort_values('crop_change', ascending=True).head(10)
plt.figure(figsize=(8,6))
sns.barplot(
x='Crop_Yield_Change',
y='Region',
data=top_crop,
hue='Country',
palette=palette_dict,
dodge=False
)
plt.xlabel('Absolute Change in Crop Yield (MT/HA)')
plt.ylabel('Region')
plt.title('Top 10 Regions with Greatest Crop Yield Change')
plt.legend(title='Country')
plt.tight_layout()
plt.show()
plt.figure(figsize=(8,6))
sns.barplot(
x='Soil_Health_Index_Change',
y='Region',
data=top_shi,
hue='Country',
palette=palette_dict,
dodge=False
)
plt.xlabel('Absolute Change in Soil Health Index')
plt.ylabel('Region')
plt.title('Top 10 Regions with Greatest Soil Health Index Change')
plt.legend(title='Country')
plt.tight_layout()
plt.show()
# Now let's look at extreme weather events
# because they are binned, we want to see which regions have the highest count of
# very high levels
# however, it might also be interesting to see which countries fluctuate the most
# between low and very high levels
# We can use a heatmap for this!
pivot = data_cleaned_2.pivot_table(
index='Country',
columns='Extreme_Weather_Events',
aggfunc='size',
fill_value=0
)
plt.figure(figsize=(10,8))
sns.heatmap(pivot, annot=True, cmap='Reds')
plt.title('Extreme Weather Events by Country')
plt.xlabel('Event Level')
plt.ylabel('Country')
plt.show()
# countries were used instead of regions here simply because there are too many regions
# for the visualization to be clean# Here we want to see how climate variables correlate with crop yield:
# Climate vars --> temp, precipitation, CO2 emissions, and extreme weather events
# from before we found that:
# the first three vars do not vary significantly year-to-year
# USA has the highest extreme weather levels
# For the first three variables, we can do a pairplot to see correlation maybe?
# For the extreme_weather_events, we can focus in on the USA or even Australia# 1) Climate variable correlation
cols = [
'Average_Temperature_C',
'Total_Precipitation_mm',
'CO2_Emissions_MT',
'Crop_Yield_MT_per_HA'
]
sns.pairplot(data_cleaned_3[cols])
plt.show()# Use a heatmap instead
cols = [
'Average_Temperature_C',
'Total_Precipitation_mm',
'CO2_Emissions_MT',
'Crop_Yield_MT_per_HA','Soil_Health_Index'
]
corr = data[cols].corr()
sns.heatmap(corr, annot=True, cmap = 'RdBu', vmin=-1, vmax=1)
plt.title("Correlation Between Crop Yield and Climate Variables")
plt.show();# Exploring temperature against crop yield further:
sns.scatterplot(
data=data_cleaned_3,
x='Average_Temperature_C',
y='Crop_Yield_MT_per_HA'
)
plt.xlabel('Average Temperature (°C)')
plt.ylabel('Crop Yield (MT/HA)')
plt.title('Temperature vs Crop Yield')
plt.show();# What if we look at it for different crop types?
g = sns.FacetGrid(data_cleaned_1, col='Crop_Type', col_wrap=3, height=4)
g.map_dataframe(sns.scatterplot, x='Average_Temperature_C', y='Crop_Yield_MT_per_HA')
g.map_dataframe(sns.regplot, x='Average_Temperature_C', y='Crop_Yield_MT_per_HA', scatter=False)
g.set_titles("{col_name}")
plt.show()
# no trends present# Correlation between extreme weather events and crop yield
usa = data_cleaned_3[data_cleaned_3['Country'] == 'USA']
sns.boxplot(
data=usa,
x='Extreme_Weather_Events',
y='Crop_Yield_MT_per_HA'
)
plt.title("Crop Yield Across Extreme Weather Categories in the USA")
plt.show();
#ALL countries/regions
sns.boxplot(
data=data_cleaned_3,
x='Extreme_Weather_Events',
y='Crop_Yield_MT_per_HA'
)
plt.title("Crop Yield Across Extreme Weather Categories for All Countries")
plt.show();# Numerical data tables:
# USA summary using describe
usa_summary = usa.groupby('Extreme_Weather_Events')['Crop_Yield_MT_per_HA'].describe()
usa_summary['Range'] = usa_summary['max'] - usa_summary['min']
display("USA Crop Yield Summary by Extreme Weather Events")
display(usa_summary)
# All countries summary using describe
global_summary = data_cleaned_3.groupby('Extreme_Weather_Events')['Crop_Yield_MT_per_HA'].describe()
global_summary['Range'] = global_summary['max'] - global_summary['min']
display("\nGlobal Crop Yield Summary by Extreme Weather Events")
display(global_summary)'USA Crop Yield Summary by Extreme Weather Events'
| count | mean | std | min | 25% | 50% | 75% | max | Range | |
|---|---|---|---|---|---|---|---|---|---|
| Extreme_Weather_Events | |||||||||
| Low | 270.0 | 2.196337 | 0.932528 | 0.513 | 1.530 | 2.1105 | 2.8755 | 4.58 | 4.067 |
| Medium | 273.0 | 2.214839 | 0.932913 | 0.459 | 1.512 | 2.1330 | 2.9000 | 4.68 | 4.221 |
| High | 274.0 | 2.201522 | 1.010211 | 0.468 | 1.395 | 2.1120 | 2.8530 | 4.89 | 4.422 |
| Very High | 215.0 | 2.366972 | 0.964928 | 0.570 | 1.658 | 2.3760 | 3.0045 | 4.90 | 4.330 |
'\nGlobal Crop Yield Summary by Extreme Weather Events'
| count | mean | std | min | 25% | 50% | 75% | max | Range | |
|---|---|---|---|---|---|---|---|---|---|
| Extreme_Weather_Events | |||||||||
| Low | 2755.0 | 2.236892 | 0.982763 | 0.468 | 1.45800 | 2.1600 | 2.951 | 4.94 | 4.472 |
| Medium | 2724.0 | 2.259086 | 1.001251 | 0.450 | 1.48875 | 2.1885 | 2.934 | 5.00 | 4.550 |
| High | 2696.0 | 2.224329 | 1.003056 | 0.450 | 1.43100 | 2.1500 | 2.892 | 4.99 | 4.540 |
| Very High | 1825.0 | 2.239447 | 1.010630 | 0.450 | 1.41300 | 2.2140 | 2.934 | 5.00 | 4.550 |
# A couple of subquestions:
# How does economic impact relate to crop yield?
def summary_df(data, group_col, value_col):
summary = data.groupby(group_col)[value_col].agg(['mean', 'std', 'count']).reset_index()
summary['sem'] = summary['std'] / np.sqrt(summary['count']) # standard error
return summary
# How does economic impact differ across countries?
sns.barplot(
x='Country',
y='Economic_Impact_Million_USD',
hue = 'Country',
palette= 'hls',
data=data_cleaned_4,
dodge=False
)
plt.xticks(rotation = 45)
plt.show();
country_summary = summary_df(data_cleaned_4, 'Country', 'Economic_Impact_Million_USD')
print("Economic Impact by Country")
print(country_summary)
# How does economic impact differ across adaptation strategies?
sns.barplot(
x='Adaptation_Strategies',
y='Crop_Yield_MT_per_HA',
hue = 'Adaptation_Strategies',
palette= 'crest',
data=data_cleaned_4,
dodge=False
)
plt.xticks(rotation = 45)
plt.show();
strategy_summary = summary_df(data_cleaned_4, 'Adaptation_Strategies', 'Crop_Yield_MT_per_HA')
print("\nCrop Yield by Adaptation Strategy")
print(strategy_summary)Economic Impact by Country
Country mean std count sem
0 Argentina 674.055935 416.156121 984 13.266573
1 Australia 660.917752 418.982861 1032 13.042367
2 Brazil 675.135339 412.385134 944 13.421993
3 Canada 680.121931 410.546507 984 13.087745
4 China 674.689835 416.854107 1031 12.982393
5 France 663.225501 404.198313 978 12.924837
6 India 683.412673 418.990602 1025 13.087068
7 Nigeria 699.281147 429.024206 1029 13.374394
8 Russia 666.132914 417.119406 961 13.455465
9 USA 665.057074 400.749492 1032 12.474787
Crop Yield by Adaptation Strategy
Adaptation_Strategies mean std count sem
0 Crop Rotation 2.271590 1.012447 1957 0.022886
1 Drought-resistant Crops 2.244425 0.991208 1995 0.022192
2 No Adaptation 2.237939 1.001804 2024 0.022268
3 Organic Farming 2.238188 0.982698 1975 0.022112
4 Water Management 2.209385 1.003231 2049 0.022163
# How does economic impact relate to crop yield?
sns.scatterplot(
x = 'Crop_Yield_MT_per_HA',
y = 'Economic_Impact_Million_USD',
data = data_cleaned_4,
)
plt.show();
#Less dense
sns.regplot(
data = data_cleaned_4.sample(1000, random_state=42),
x = 'Crop_Yield_MT_per_HA',
y = 'Economic_Impact_Million_USD'
)
plt.show();